Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

نویسندگان

Liang Jin

Chen Li

چکیده

Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964.” Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called Sepia, to solve the problem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance function. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of fuzzy string predicates. ∗ Supported by NSF CAREER Award No. IIS-0238586. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic-based Selectivity Estimation for Hybrid Queries over RDF Graphs

The Resource Description Framework (RDF) has become an accepted standard for describing entities on the Web. Many such RDF descriptions are text-rich – besides structured data, they also feature large portions of unstructured text. As a result, RDF data is frequently queried using predicates matching structured data, combined with string predicates for textual constraints: hybrid queries. Evalu...

متن کامل

CXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation

Query optimization in IBM’s System RX, the first truly relational-XML hybrid data management system, requires accurate selectivity estimation of path-value pairs, i.e., the number of nodes in the XML tree reachable by a given path with the given text value. Previous techniques have been inadequate, because they have focused mainly on the tag-labeled paths (tree structure) of the XML data. For m...

متن کامل

Multi-Dimensional Substring Selectivity Estimation

With the explosion of the Internet, LDAP directories and XML, there is an ever greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions. EEective query optimization in this context requires good selectivity estimates. In this paper, we use multi-dimensional count-suux trees as t...

متن کامل

Fuzzy Inference System Approach in Deterministic Seismic Hazard, Case Study: Qom Area, Iran

Seismic hazard assessment like many other issues in seismology is a complicated problem, which is due to a variety of parameters affecting the occurrence of an earthquake. Uncertainty, which is a result of vagueness and incompleteness of the data, should be considered in a rational way. Using fuzzy method makes it possible to allow for uncertainties to be considered. Fuzzy inference system,...

متن کامل

Fuzzy Inference System Approach in Deterministic Seismic Hazard, Case Study: Qom Area, Iran

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

نویسندگان

چکیده

منابع مشابه

Topic-based Selectivity Estimation for Hybrid Queries over RDF Graphs

CXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation

Multi-Dimensional Substring Selectivity Estimation

Fuzzy Inference System Approach in Deterministic Seismic Hazard, Case Study: Qom Area, Iran

Fuzzy Inference System Approach in Deterministic Seismic Hazard, Case Study: Qom Area, Iran

عنوان ژورنال:

اشتراک گذاری